humaneval+

pairwise wins

p-values

result table

model pass1 win_rate elo
0 opencodeinterpreter-ds-33b 0.738 0.777 1248.264
1 meta-llama-3-70b-instruct 0.720 0.754 1226.816
2 mixtral-8x22b-instruct-v0.1 0.720 0.743 1213.711
3 HuggingFaceH4--starchat2-15b-v0.1 0.713 0.743 1214.953
4 deepseek-coder-7b-instruct-v1.5 0.713 0.742 1213.325
5 opencodeinterpreter-ds-6.7b 0.701 0.715 1186.565
6 xwincoder-34b 0.695 0.706 1177.236
7 speechless-coder-ds-6.7b 0.659 0.653 1131.674
8 code-llama-70b-instruct 0.659 0.647 1125.044
9 white-rabbit-neo-33b-v1 0.659 0.646 1124.281
10 speechless-starcoder2-15b 0.628 0.597 1084.457
11 bigcode--starcoder2-15b-instruct-v0.1 0.604 0.557 1048.541
12 microsoft--Phi-3-mini-4k-instruct 0.591 0.548 1042.756
13 Qwen--Qwen1.5-72B-Chat 0.591 0.539 1035.047
14 code-13b 0.524 0.442 955.455
15 speechless-starcoder2-7b 0.518 0.427 942.880
16 codegemma-7b-it 0.518 0.420 938.507
17 speechless-coding-7b-16k-tora 0.506 0.409 927.939
18 code-33b 0.494 0.392 912.610
19 open-hermes-2.5-code-290k-13b 0.488 0.382 903.425
20 starcoder2-15b-oci 0.433 0.307 836.395
21 codegemma-7b 0.415 0.306 834.815
22 mixtral-8x7b-instruct 0.396 0.266 796.939
23 mistralai--Mistral-7B-Instruct-v0.2 0.360 0.224 753.146
24 gemma-1.1-7b-it 0.354 0.203 727.874
25 octocoder 0.329 0.195 721.395
26 python-code-13b 0.305 0.160 675.949